Lab 1: Visualization & Data Preprocessing¶

Student Exam Scores Extended Dataset
¶


Team Members: Joey Hernandez, Daniel Chang, and Vo Nguyen
¶

Table of Contents¶

  • Business Understanding
  • Data Understanding
    • Import Library
    • Loading the Data
    • Data Description
  • Data Quality
    • Data Inspection
    • Missing Values At A Glance
    • Missing Values Heatmap
    • Addressing Duplicates
    • Addressing Outliers
  • Data Summary Statistics
  • Top Features | Data Visualization
    • 1) Gender
    • 2) Parent Education
    • 3) Ethnic Groups
    • 4) Test Prep
    • 5) Lunch Type
  • Exploration of Attribute Relationships
    • Cross Tabulation
    • Correlation Heatmaps
    • Visual Exploration
  • Creating Additional Features
    • Pass/Fail
  • Exceptional Work
    • Multiple Correspondence Analysis
  • Conclusion

Business Understanding [↑](#toc)¶

The dataset (fictional) comprises scores from three tests of students at a public school, as well as personal and socio-economic factors that may have interaction effects. The purpose of collecting this data is to identify potential relationships between personal and socio-economic factors and test scores. By understanding which factors have the most significant impact on test scores, educators, parents, and students can focus their attention on variables that could potentially help students earn higher scores. This information can be used to inform educational policies and practices, as well as to develop targeted interventions to support students who may be struggling academically. Ultimately, the goal of collecting this data is to improve educational outcomes and promote academic success for all students.

Student education and testing scores are critical to their future success. Education provides the foundation upon which students will build their future, and improving the efficiency of learning and testing experiences can have a positive impact on many students' futures. By analyzing the dataset, we can identify factors that influence student performance and develop strategies to improve academic outcomes.

Our metric of success in the data mining process can be assessed with whether or not we are able to uncover patterns and/or significant relationships that help us to better understand student performance factors. This could include identifying significant predictors of test scores, understanding the impact of certain attributes on student scores, etc. By evaluating the results of our analyses in light of the original research questions, we can determine whether we have gained useful insights from our data mining process which can inform educational policies and practices and promote academic success for all students.

To measure the effectiveness of a good prediction algorithm, we need to evaluate its accuracy in predicting test scores and its interpretability in identifying the factors that influence those scores. Accuracy can be measured using standard metrics such as mean squared error or root mean squared error, which compare the predicted test scores to the actual scores. However, accuracy alone is not sufficient to evaluate the effectiveness of a prediction algorithm. We also need to consider the interpretability of the results, which requires us to examine the relationship between the input variables and the predicted test scores.


Data Understanding ↑¶

Our team will be performing exploratory data analysis with an emphasis on visualization and data processing using the dataset found on Kaggle. The EDA will consist of statistical summaries of attributes, visualizing attributes, and providing interpretations and conclusions based on the visualization and data processing.

Import Library ↑¶

In [1]:
# Needed Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline

import seaborn as sns
from tabulate import tabulate
In [2]:
# remove "future" warnings
warnings.simplefilter(action="ignore", category= FutureWarning)

Loading the Data ↑¶

In [3]:
df = pd.read_csv("Expanded_data_with_more_features.csv")
#  select all columns except the first column (index)
df = df[df.columns[1:]]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30641 entries, 0 to 30640
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Gender               30641 non-null  object 
 1   EthnicGroup          28801 non-null  object 
 2   ParentEduc           28796 non-null  object 
 3   LunchType            30641 non-null  object 
 4   TestPrep             28811 non-null  object 
 5   ParentMaritalStatus  29451 non-null  object 
 6   PracticeSport        30010 non-null  object 
 7   IsFirstChild         29737 non-null  object 
 8   NrSiblings           29069 non-null  float64
 9   TransportMeans       27507 non-null  object 
 10  WklyStudyHours       29686 non-null  object 
 11  MathScore            30641 non-null  int64  
 12  ReadingScore         30641 non-null  int64  
 13  WritingScore         30641 non-null  int64  
dtypes: float64(1), int64(3), object(10)
memory usage: 3.3+ MB

Data Description ↑¶

Data Meaning and Types:¶

Feature Description Data Type Value Type (Scale)
Gender Gender of the Student Object (Categorical) Male/Female
EthnicGroup Ethnicity of the Student Object (Categorical) Group A to E
ParentEduc Parent(s) Educational Background Object (Categorical) Some High School to Master's Degree
LunchType School Lunch Type Object (Categorical) Standard or Free/Reduced
TestPrep Test Preparation Course Completed Object (Categorical) Completed or None
ParentMaritalStatus Parent(s) Martial Status Object (Categorical) Married/Single/Windowed/Divorced
PracticeSport How Often Student Practice Sport(s) Object (Categorical) Never/Sometimes/Regularly
IsFirstChild First Child of the Family Object (Categorical) Yes/No
NrSiblings # of Siblings Int64(Continuous) 0 - 7
TransportMeans Transportation Object (Categorical) School Bus or Private
WklyStudyHours # of Hours Study Weekly Object (Categorical) Less than 5 hours; Between 5 and 10 hours; More than 10 hours
MathScore Math Test Score Int64(Continuous) 0 - 100
ReadingScore Reading Test Score Int64(Continuous) 0 - 100
WritingScore Writing Test Score Int64(Continuous) 0 - 100

Data Quality ↑¶

Data Inspection ↑¶

The original (extended) dataset has 30,641 rows and 14 attributes. The "id" attribute was removed early in our data exploration as it is an index and was not of use for our analysis.

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30641 entries, 0 to 30640
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Gender               30641 non-null  object 
 1   EthnicGroup          28801 non-null  object 
 2   ParentEduc           28796 non-null  object 
 3   LunchType            30641 non-null  object 
 4   TestPrep             28811 non-null  object 
 5   ParentMaritalStatus  29451 non-null  object 
 6   PracticeSport        30010 non-null  object 
 7   IsFirstChild         29737 non-null  object 
 8   NrSiblings           29069 non-null  float64
 9   TransportMeans       27507 non-null  object 
 10  WklyStudyHours       29686 non-null  object 
 11  MathScore            30641 non-null  int64  
 12  ReadingScore         30641 non-null  int64  
 13  WritingScore         30641 non-null  int64  
dtypes: float64(1), int64(3), object(10)
memory usage: 3.3+ MB

Missing Values At A Glance ↑¶

  • The data set contained many missing values at first glance. Upon further investigation, we found that many columns of data contain "nan".

    • The nan values are only in categorical variables and could likely be attributed to not having an answer that fits within the given responses. It's hard to determine this since we have no information on how each response was gathered, and from who/how it was gathered.

    • The missing data will of course be handled on a case by case basis, but we will evaluate the missing values to uncover potential reasons that will aid us in understanding how to handle the NANs.

    • In total NAN makes up 37.2% of the data in our dataframe.

Missing Values - Potential Reasons:¶

Ultimately there can be numerous reasons for why there are so many missing values especially when we have limited knowledge on the data gathering and study creation process. Left to our assumptions some possible reasons for the missing values could be the following:

  • Ethnic Group: This could be the result of not fitting into Group A-E, or a choice to not provide the information. or truly missing data.

  • Parent Edu: This could be a result of not fitting into the specific groups (having a PhD for example), or a choice not to provide the information. or truly missing data.

  • TestPrep: This is likely to be a result of not fitting into the specific groups (maybe being "in progress", or "started"/"some"). The reasoning is that this is (likely) not a controversial or discriminatory question in nature and would leave us with a reasonable assumption.

  • ParentMartitalStatus: This is likely to be a result of choosing to not offer a response, or truly missing data. (additionally there is the potential for this to be due to not having the answer choice "never married" meaning the parents are together but never married.)

  • Practice Sport: This could be a result of not fitting into a specific group (there is no answer for "not in sports", or addressing other extra curricular activities, etc).

  • Is First Child: likely to be missing data. Alternatively many instances of twins who don't know who was first or have a "step" sibling of the same age.

  • Transport Means: can be missing data or likely to be result of not fitting into the specific group (no walking/biking option)

  • Wkly Study Hours: likely to be missing data since the scale of options is covered, or maybe even "other" if they participate in an activity that is study related but isn't "independent" studying.

Because we are unsure of the true reasoning it is best to investigate the data itself to see if we can uncover some of the missing data characteristics that will help aid our decision for handling the missing data.

Missing Value Heatmap ↑¶

  • So that we could better understand any underlying patterns or associations in the NAN values found in our data we created a NAN heatmap. The heatmap will:
    • Provide a visual representation of missing data patterns across variables in a dataset.
    • Quickly assess the extent of missing data for each variable.
    • Reveal relationships between missing values across different variables.
    • Determine missingness mechanisms such as MCAR, MAR, MNAR.
In [5]:
#missing data heatmap
plt.figure(figsize=(12,12), dpi = 300)
sns.heatmap(df.isnull(), cmap = "YlGnBu", cbar = False)
plt.show()
In [6]:
# Checking Missing Data
df.isnull().sum()
Out[6]:
Gender                    0
EthnicGroup            1840
ParentEduc             1845
LunchType                 0
TestPrep               1830
ParentMaritalStatus    1190
PracticeSport           631
IsFirstChild            904
NrSiblings             1572
TransportMeans         3134
WklyStudyHours          955
MathScore                 0
ReadingScore              0
WritingScore              0
dtype: int64
In [7]:
# Percentage of missing data
100 * df.isnull().sum() / len(df)
Out[7]:
Gender                  0.000000
EthnicGroup             6.005026
ParentEduc              6.021344
LunchType               0.000000
TestPrep                5.972390
ParentMaritalStatus     3.883685
PracticeSport           2.059332
IsFirstChild            2.950295
NrSiblings              5.130381
TransportMeans         10.228126
WklyStudyHours          3.116739
MathScore               0.000000
ReadingScore            0.000000
WritingScore            0.000000
dtype: float64
In [8]:
def percent_missing(df):
    '''
    INPUT
        df - dataframe
    OUTPUT
        percent_nan - percentage of nulls in dataframe
    '''
    percent_nan = 100* df.isnull().sum()/len(df)
    percent_nan = percent_nan[percent_nan > 0].sort_values()
    return percent_nan

percent_nan = percent_missing(df)
In [9]:
# plot of missing data (percentage) 
plt.figure(dpi = 150)
sns.barplot(x=percent_nan.index, y = percent_nan)
plt.title("Bar Plot of Missing Data (%)")
plt.xlabel("Explanatory Variables")
plt.ylabel("Percent Missing (%)")
plt.xticks(rotation = 45);

- Handling Missing Values¶

After investigation of our data and under the assumption that our data are missing completely at random, there are a number of ways for us to proceed: 1) Discard observations with any missing values 2) Relying on a learning algorithm to deal with missing values in the training phase 3) Impute all missing values before training.

The 2 options discussed for our group:

Option 1 would effectively remove 37.2% of the observations in our data frame. This is quite a substantial loss of information and was ultimately decided against.

Option 3 (imputation). This option seems to make sense given that the data appear to be MCAR, it will allow us to keep the 37% of our data effectively preserving our sample size, and it will help us to maintain more statistical power and precision. It should however be noted that while data imputation is very useful in handling missing values some tradeoffs and consideration are to be had such as:

  • Potential Bias
  • Assumption reliance
  • Loss of Variability
  • Uncertainty estimation

Missing Value Decision Application¶

In [10]:
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

Addressing Duplicates ↑¶

There are no duplicated rows or columns in the data. While there may be certain values (such as scores) which are "duplicated" this is to be expected.

In [11]:
# viewing duplicated data rows/cols
duplicate_sum = df.duplicated().sum()
print("Total Duplicated Rows", duplicate_sum)

duplicate_sum_column = df.duplicated(). sum(axis =0)
print("Total Duplicated Columns", duplicate_sum_column)
Total Duplicated Rows 1
Total Duplicated Columns 1

Addressing Outliers ↑¶

Histogram of Scores¶

To see the overall distribution and outliers of our data we created histograms for the test score variables. Additionally, so that we can make efficient use of our plots, we have overlaid the median score on the histograms so that we can see where that lies within the distribution for each of the test scores.

We chose the median because we found that the data is skewed; therefore, the median is the more appropriate way method.

In [12]:
math_m = df["MathScore"].median()
write_m = df["WritingScore"].median()
read_m = df["ReadingScore"].median()

#math_m, write_m, read_m
m_df = pd.DataFrame({"Math Score Median":[math_m],
                           "Writing Score Median": [write_m],
                           "Reading Score Median":[read_m]})

m_df
Out[12]:
Math Score Median Writing Score Median Reading Score Median
0 67.0 69.0 70.0
In [13]:
# plot histogram and median
fig, axes = plt.subplots(nrows=2, ncols=2,figsize=(12, 12))
a = sns.histplot(x = "ReadingScore",data=df,ax=axes[0,0], bins = 20)
a.axvline(read_m, color = 'r', linestyle = 'dashed', linewidth = 2)
b = sns.histplot(x = "WritingScore",data=df,ax=axes[0,1],bins = 20)
b.axvline(write_m, color = 'r', linestyle = 'dashed', linewidth = 2)
b = sns.histplot(x = "MathScore",data=df,ax=axes[1,0],bins = 20)
b.axvline(math_m, color = 'r', linestyle = 'dashed', linewidth = 2)

axes[0, 0].set_title("Histogram of Reading Score with Median")
axes[0, 1].set_title("Histogram of Writing Score with Median")
axes[1, 0].set_title("Histogram of Math Score with Median")

fig.delaxes(axes[1][1]);

IQR and Identifying Outliers¶

While there are a number of outliers and one notable one is 0 in Math Score, we have decided that it wouldn't make any sense to remove any of them because we believe that every single value in each of these score columns are rather important. We will not remove any data points.

In [14]:
def find_IQR(df,x):
    '''
    INPUT
        df - dataframe to be used
        x - interested column
    OUTPUT
        Q1 - First Quartile
        Q3 - Third Quartile
        IQR - Interquartile Range of interested column
    '''
    # IQR
    Q1 = np.percentile(df[x], 25)
    Q3 = np.percentile(df[x], 75)
    IQR = Q3 - Q1
    return Q1, Q3, IQR
In [15]:
# Math IQR
Q1, Q3, IQR = find_IQR(df, "MathScore")

# Upper Bound
math_upper=Q3+1.5*IQR
mathupper_array=df[np.array(df["MathScore"]>=math_upper)]
In [16]:
# Lower bound
math_lower=Q1-1.5*IQR
mathlower_array=df[np.array(df["MathScore"]<=math_lower)]

mathlower_array
Out[16]:
Gender EthnicGroup ParentEduc LunchType TestPrep ParentMaritalStatus PracticeSport IsFirstChild NrSiblings TransportMeans WklyStudyHours MathScore ReadingScore WritingScore
17 female group B some high school free/reduced none divorced sometimes yes 1.0 school_bus 5 - 10 18 32 28
55 female group C some high school free/reduced none single regularly yes 3.0 school_bus > 10 0 17 10
136 female group C some college free/reduced none single regularly yes 1.0 school_bus < 5 21 38 33
319 female group B some high school free/reduced none single sometimes no 3.0 school_bus < 5 23 37 26
751 female group B some college standard none single regularly no 4.0 private < 5 18 37 32
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28544 female group C some college free/reduced none single sometimes yes 3.0 private 5 - 10 11 39 34
28597 female group D high school free/reduced completed married sometimes yes 2.0 school_bus > 10 9 32 32
29000 male group A some college free/reduced none married sometimes no 1.0 school_bus < 5 21 18 21
29654 female group B associate's degree free/reduced none single sometimes no 4.0 private 5 - 10 19 41 34
29947 female group B some college free/reduced none married regularly no 1.0 private 5 - 10 23 44 38

93 rows × 14 columns

In [17]:
# Reading IQR
Q1, Q3, IQR = find_IQR(df, "ReadingScore")

# Upper Bound
reading_upper=Q3+1.5*IQR
readingupper_array=df[np.array(df["ReadingScore"]>=reading_upper)]
In [18]:
# Lower bound
reading_lower=Q1-1.5*IQR
readinglower_array=df[np.array(df["ReadingScore"]<=reading_lower)]

readinglower_array
Out[18]:
Gender EthnicGroup ParentEduc LunchType TestPrep ParentMaritalStatus PracticeSport IsFirstChild NrSiblings TransportMeans WklyStudyHours MathScore ReadingScore WritingScore
55 female group C some high school free/reduced none single regularly yes 3.0 school_bus > 10 0 17 10
69 male group E some high school standard none married sometimes no 5.0 school_bus 5 - 10 28 25 21
308 male group A some college free/reduced none married regularly yes 2.0 school_bus > 10 29 23 19
565 male group B high school free/reduced none married regularly no 2.0 school_bus 5 - 10 30 24 15
928 female group B high school free/reduced none married sometimes yes 1.0 private 5 - 10 8 24 23
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28698 male group C associate's degree free/reduced none married sometimes no 1.0 school_bus 5 - 10 27 26 23
29000 male group A some college free/reduced none married sometimes no 1.0 school_bus < 5 21 18 21
30389 male group B associate's degree free/reduced none divorced never no 1.0 school_bus 5 - 10 24 14 4
30412 male group C some high school free/reduced none single regularly no 1.0 private 5 - 10 33 25 29
30478 male group D associate's degree standard none married sometimes no 1.0 school_bus < 5 32 19 20

90 rows × 14 columns

In [19]:
# Writing IQR
Q1, Q3, IQR = find_IQR(df, "WritingScore")

# Upper Bound
write_upper=Q3+1.5*IQR
writeupper_array=df[np.array(df["WritingScore"]>=write_upper)]
In [20]:
# Lower bound
write_lower=Q1-1.5*IQR
writelower_array=df[np.array(df["WritingScore"]<=write_lower)]

writelower_array
Out[20]:
Gender EthnicGroup ParentEduc LunchType TestPrep ParentMaritalStatus PracticeSport IsFirstChild NrSiblings TransportMeans WklyStudyHours MathScore ReadingScore WritingScore
55 female group C some high school free/reduced none single regularly yes 3.0 school_bus > 10 0 17 10
69 male group E some high school standard none married sometimes no 5.0 school_bus 5 - 10 28 25 21
308 male group A some college free/reduced none married regularly yes 2.0 school_bus > 10 29 23 19
319 female group B some high school free/reduced none single sometimes no 3.0 school_bus < 5 23 37 26
565 male group B high school free/reduced none married regularly no 2.0 school_bus 5 - 10 30 24 15
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
29274 female group E some college free/reduced none married sometimes no 1.0 school_bus 5 - 10 30 32 26
29817 male group D high school free/reduced none married sometimes no 1.0 private 5 - 10 33 31 22
30389 male group B associate's degree free/reduced none divorced never no 1.0 school_bus 5 - 10 24 14 4
30478 male group D associate's degree standard none married sometimes no 1.0 school_bus < 5 32 19 20
30630 male group B associate's degree free/reduced none married sometimes no 4.0 private 5 - 10 43 28 24

109 rows × 14 columns

In [21]:
# Create a DataFrame of the scores
scores_df = df[["ReadingScore", "MathScore", "WritingScore"]]

# Melt the DataFrame to "long-form" format
scores_melt = pd.melt(scores_df)

# Create the boxplot using Seaborn
sns.boxplot(x="variable", y="value", data=scores_melt)

# Add labels to the x-axis ticks
plt.xticks([0, 1, 2], ["Reading Scores", "Math Scores", "Writing Scores"])

# Add labels to the y-axis
plt.ylabel("Score")

# Add a title
plt.title("Boxplot of Reading, Math, and Writing Scores")

# Display the plot
plt.show()

Dataset Summary Statistics ↑¶

Analyzing summary statistics, handling outliers, missing values, and duplicated data are essential steps in the data preprocessing phase of conducting EDA. Additionally, this step ensures we are able to have reliable and accurate data to work with and use to conduct experiments where we have a full understanding of the interpretability and application that our data allows.

  • Furthermore this process will aid us in various areas such as:

    • Understanding Data: Summary statistics provide a concise overview of the dataset, enabling data scientists to gain insights, identify patterns, and guide further analysis.
    • Ensuring Data Quality: Handling outliers, missing values, and duplicate data is crucial for maintaining the reliability and integrity of data analysis and modeling processes.
    • Maintaining Model Performance: Properly addressing outliers and missing values improves model stability and accuracy, leading to more reliable predictions.
    • Avoiding Biased Conclusions: By appropriately handling outliers, missing values, and duplicate data, data scientists can minimize bias and ensure that their findings are representative and unbiased.
    • Data Quality Assurance: Cleaning and preprocessing the data effectively ensures that subsequent analyses and models are built on reliable and high-quality data.
    • Improving Efficiency: Dealing with outliers, missing values, and duplicate data streamlines the dataset, making data processing and analysis more efficient, accurate, and computationally effective.

Metrics Defined¶

  • Range: The range is the difference between the highest and the lowest values in a dataset. It gives an indication of how spread out the data is.

  • Mode: The mode is the value that appears most frequently in a dataset. It represents the most common or typical value in the data.

  • Mean: The mean is the average of all values in a dataset. It is calculated by summing up all the values and dividing by the number of values.

  • Median: The median is the middle value of a dataset when it is ordered from lowest to highest (or highest to lowest). It is used as a measure of central tendency when the dataset contains outliers or is not normally distributed.

  • Standard Deviation: The standard deviation is a measure of the spread or dispersion of a dataset. It represents how much the values in a dataset vary from the mean. A higher standard deviation indicates a larger spread of the data.

  • Min: The minimum value is the smallest value in a dataset.

  • 25%: The 25th percentile, also known as the first quartile, is the value below which 25% of the data falls.

  • 50%: The 50th percentile, also known as the median, is the middle value of the dataset.

  • 75%: The 75th percentile, also known as the third quartile, is the value below which 75% of the data falls.

  • Max: The maximum value is the largest value in a dataset.

  • Count: The count is the total number of values in a dataset.

In [22]:
df.describe().transpose()
Out[22]:
count mean std min 25% 50% 75% max
NrSiblings 30641.0 2.087106 1.442665 0.0 1.0 2.0 3.0 7.0
MathScore 30641.0 66.558402 15.361616 0.0 56.0 67.0 78.0 100.0
ReadingScore 30641.0 69.377533 14.758952 10.0 59.0 70.0 80.0 100.0
WritingScore 30641.0 68.418622 15.443525 4.0 58.0 69.0 79.0 100.0
In [23]:
# Calculate the statistics using describe()
reading_stats = round(df["ReadingScore"].describe(),2)
math_stats = round(df["MathScore"].describe(),2)
writing_stats = round(df["WritingScore"].describe(),2)

# Calculate the mode separately
reading_mode = df["ReadingScore"].mode()
math_mode = df["MathScore"].mode()
writing_mode = df["WritingScore"].mode()

# Calculate the Median Separately
reading_median = df["ReadingScore"].median()
math_median = df["MathScore"].median()
writing_median = df["WritingScore"].median()

# Calculate the Range Separately
reading_range = (df["ReadingScore"].max()-df["ReadingScore"].min())
math_range = (df["MathScore"].max()-df["MathScore"].min())
writing_range = (df["WritingScore"].max()-df["WritingScore"].min())


# Add the mode to the statistics DataFrame
reading_stats["Mode"] = reading_mode[0]
math_stats["Mode"] = math_mode[0]
writing_stats["Mode"] = writing_mode[0]

# Add the Median to the statistics DataFrame
reading_stats["Median"] = reading_median
math_stats["Median"] = math_median
writing_stats["Median"] = writing_median

# Add the Range to the statistics DataFrame
reading_stats["Range"] = reading_range
math_stats["Range"] = math_range
writing_stats["Range"] = writing_range

# Combine the statistics into a single DataFrame
stats_df = pd.DataFrame({"Reading Score": reading_stats,
                         "Math Score": math_stats,
                         "Writing Score": writing_stats})

# Convert the DataFrame to a visual table
table = tabulate(stats_df, headers='keys', tablefmt='fancy_grid')

# Print the table
print(table)
╒════════╤═════════════════╤══════════════╤═════════════════╕
│        │   Reading Score │   Math Score │   Writing Score │
╞════════╪═════════════════╪══════════════╪═════════════════╡
│ count  │        30641    │     30641    │        30641    │
├────────┼─────────────────┼──────────────┼─────────────────┤
│ mean   │           69.38 │        66.56 │           68.42 │
├────────┼─────────────────┼──────────────┼─────────────────┤
│ std    │           14.76 │        15.36 │           15.44 │
├────────┼─────────────────┼──────────────┼─────────────────┤
│ min    │           10    │         0    │            4    │
├────────┼─────────────────┼──────────────┼─────────────────┤
│ 25%    │           59    │        56    │           58    │
├────────┼─────────────────┼──────────────┼─────────────────┤
│ 50%    │           70    │        67    │           69    │
├────────┼─────────────────┼──────────────┼─────────────────┤
│ 75%    │           80    │        78    │           79    │
├────────┼─────────────────┼──────────────┼─────────────────┤
│ max    │          100    │       100    │          100    │
├────────┼─────────────────┼──────────────┼─────────────────┤
│ Mode   │           65    │        64    │           67    │
├────────┼─────────────────┼──────────────┼─────────────────┤
│ Median │           70    │        67    │           69    │
├────────┼─────────────────┼──────────────┼─────────────────┤
│ Range  │           90    │       100    │           96    │
╘════════╧═════════════════╧══════════════╧═════════════════╛

Significance of the Values Calculated:¶

Range: The range of Reading Score, Math Score, and Writing Score can provide an idea of how spread out the scores are and the variability of the performance of students in these subjects.

Mode: The mode of Reading Score, Math Score, and Writing Score can provide an idea of the most common score achieved by students in these subjects, which can be useful for identifying areas of strength and weakness.

Mean: The mean of Reading Score, Math Score, and Writing Score can provide an idea of the average performance of students in these subjects, which can be useful for comparing the performance of different groups of students.

Median: The median of Reading Score, Math Score, and Writing Score can provide an idea of the central tendency of the scores, which can be useful for identifying outliers and extreme values that may skew the mean.

Standard Deviation: The standard deviation of Reading Score, Math Score, and Writing Score can provide an idea of the spread of the scores around the mean, which can be useful for identifying the variability of the performance of students in these subjects.

Min: The minimum score of Reading Score, Math Score, and Writing Score can provide an idea of the lowest score achieved by students in these subjects.

25%, 50%, and 75%: The quartiles of Reading Score, Math Score, and Writing Score can provide an idea of the spread and distribution of the scores, which can be useful for identifying the performance of students in different percentile ranges.

Max: The maximum score of Reading Score, Math Score, and Writing Score can provide an idea of the highest score achieved by students in these subjects.

Count: The count of Reading Score, Math Score, and Writing Score can provide an idea of the sample size of students who took the test in each subject, which can be useful for making inferences about the population of students.


Top Features | Data Visualization ↑¶

In [24]:
#Number of Missing Values in Each Column
df.isnull().sum()

df3 = df

df3.describe()
df3.head()
Out[24]:
Gender EthnicGroup ParentEduc LunchType TestPrep ParentMaritalStatus PracticeSport IsFirstChild NrSiblings TransportMeans WklyStudyHours MathScore ReadingScore WritingScore
0 female group C bachelor's degree standard none married regularly yes 3.0 school_bus < 5 71 71 74
1 female group C some college standard none married sometimes yes 0.0 school_bus 5 - 10 69 90 88
2 female group B master's degree standard none single sometimes yes 4.0 school_bus < 5 87 93 91
3 male group A associate's degree free/reduced none married never no 1.0 school_bus 5 - 10 45 56 42
4 male group C some college standard none married sometimes yes 0.0 school_bus 5 - 10 76 78 75

1) Gender | Test Score Visualization ↑¶

Visual Interpretation

  • Looking at the density plot, we can see that the female group distribution has greater (higher) scores than the male group in Reading and Writing. The male group seemed to score higher on Math.

    Looking at the scatterplot, we see an interesting relationship between MathScore vs. Writing & Reading Score, which seems to provide additional evidence to what we witness in the density plot. There are clear divides in the scatterplots between the two genders suggesting that there may be a gender-based difference or correlation in the relationship between the math and reading/writing scores.

Why the Visual is Appropriate:

  • The pairplot function allows us to visualize pairwise relationships between numerical variables in a dataset. Additionally by examining scatter plots we are able to visually assess the strength and direction of correlations between different numerical variables. Lastly, the pairplot includes a density plot which provides a visual representation of the data distribution helping to understand the shape, skewness, and spread of each variable individually.
In [25]:
#Full Bar Plot Comparisons
sns.pairplot(df3[["WritingScore", "ReadingScore","MathScore", "Gender"]], hue= "Gender")
plt.show()
In [26]:
## boxplot
def plot_imputed(col):
    '''
    INPUT:
        col - choose interested column for the x-axis; write with quotations
    '''
    fig, axes = plt.subplots(nrows = 2, ncols = 2, figsize = (20,15))

    sns.boxplot(ax=axes[0,0], data = df3, x = col, y = "MathScore",)
    sns.boxplot(ax = axes[0,1], data = df3, x = col, y = "ReadingScore")
    sns.boxplot(ax=axes[1,0], data = df3, x = col, y = "WritingScore")
    sns.countplot(ax=axes[1,1], x=col, data=df3)

    axes[0,0].set_title(f"Math Scores by {col}")
    axes[0,1].set_title(f"Reading Scores by {col}")
    axes[1,0].set_title(f"Writing Scores by {col}")
    axes[1,1].set_title(f"{col} Count")

    plt.tight_layout()

    plt.show()

## multivariate plot
def plot_imputed_2var(col, col2):
    '''
    INPUT:
        col - choose interested column for the x-axis; write with quotations
        col2 - choose interested column for the hue; write with quotations
    '''
    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 15))        

    sns.boxplot(ax=axes[0, 0],x=col,  y="MathScore",data=df3,hue=col2)
    axes[0, 0].set_title(f"MathScore for {col} Grouped by {col2}")

    p = sns.boxplot(ax=axes[0, 1],x=col,  y="ReadingScore",data=df3,hue=col2)
    axes[0, 1].set_title(f"ReadingScore for ParentEduc Grouped by {col2}")

    p = sns.boxplot(ax=axes[1, 0],x=col,  y="WritingScore",data=df3,hue=col2)
    axes[1, 0].set_title(f"WritingScore for ParentEduc Grouped by {col2}")

    # customize x-axis tick labels for each subplot
    axes[0, 0].set_xticklabels(axes[0, 0].get_xticklabels(), rotation=70)
    axes[0, 1].set_xticklabels(axes[0, 1].get_xticklabels(), rotation=70)
    axes[1, 0].set_xticklabels(axes[1, 0].get_xticklabels(), rotation=70)
    
    fig.delaxes(axes[1][1])
    
    plt.tight_layout()

    plt.show()

2) Parent Education | Test Score Visualization ↑¶

Visual Interpretation

  • A pattern of increasing median test scores can be observed as "increases" in education level take place. In addition to the observation of this pattern, the boxplots below provide visual representation of the distribution of test scores and show the median, quartiles, and outliers of the data. It provides a quick visual summary of the central tendency, spread, and skewness of the data.



Why the Visual is Appropriate:

  • A box plot is a good visual for comparing education levels against test scores because it displays the distribution of scores for each level of education in a compact and intuitive way. The box represents the interquartile range (IQR) of the data, with the median (50th percentile) represented as a horizontal line inside the box. The whiskers extend to the minimum and maximum values within 1.5 times the IQR, while any data points outside this range are represented as individual dots. This allows for easy comparison of the distribution of test scores for different levels of education, as well as identifying potential outliers or extreme values.

In [27]:
df3["ParentEduc"] = df3["ParentEduc"].astype('category')
df3["ParentEduc"] = df3["ParentEduc"].cat.reorder_categories(["some high school","high school", 
                                                                            "some college", "associate's degree", "bachelor's degree", "master's degree"])

plot_imputed("ParentEduc")

3) Ethnic Groups | Test Score Visualization ↑¶

Visual Interpretation

  • We can see that the median test score increases as we move observe scores for the ethnic groups in alphabetical order. Group E, has a median that seems to increase significantly from group D. The plots below illustrate the distribution of test scores and show the median, quartiles, and outliers of the data. It provides a quick visual summary of the central tendency, spread, and skewness of the data.

Why the Visual is Appropriate:

  • Boxplots are a suitable visual representation for comparing test scores among different ethnic groups A-E because they provide a concise and informative summary of the distribution of the data, allowing for effective comparisons and identification of patterns or differences between the groups.

In [28]:
df3['EthnicGroup'] = df3['EthnicGroup'].astype('category')
df3["EthnicGroup"].cat.reorder_categories(["group A","group B", "group C", "group D", "group E"])


plot_imputed("EthnicGroup")

4) Test Prep | Test Score Visualization ↑¶

Visual Interpretation

  • As expected from an intuitive perspective, those who have completed their test prep have a higher median score than those who did none of the test prep. The box plots created visually show the distribution of test score and shows the median, quartiles, and outliers of the data. It provides a quick visual summary of the central tendency, spread, and skewness of the data.

Why the Visual is Appropriate:

  • A box plot is an appropriate visual for comparing the test scores for groups who did/didn't participate in test prep. The graphic provides a clear and concise summary of the distribution of the data for each group. By plotting the boxplots side by side, we can easily compare the central tendency, spread, and skewness of the test scores for each group. Additionally, boxplots can effectively display potential outliers and extreme values in the data, which can be helpful in identifying differences between the groups. Overall, boxplots can provide a useful visual summary of the distribution of the data and help us compare the performance of the test prep groups.
In [29]:
plot_imputed("TestPrep")

5) Lunch Type | Test Score Visualization ↑¶

Visual Interpretation

  • As we can see from our count plot, most students in this data set eat on a standard lunch plan. When we look at the boxplots, we find that those with free/reduced lunch score a significantly lower grade than those with standard lunch. The boxplot below illustrates the distribution of test score and shows the median, quartiles, and outliers of the data. It provides a quick visual summary of the central tendency, spread, and skewness of the data.
In [30]:
plot_imputed("LunchType")

Gender | Scores - Extra¶

After analyzing the data in the Gender | Top Feat. section above, we observed a visually significant pattern between test scores and gender. Females tend to achieve higher scores in reading and writing, while males tend to excel in math. to further visualize this comparative relationship we provided box plot and an additional count plot which suggests a relatively equal distribution of males and females in our dataset.

In [31]:
plot_imputed("Gender")

Exploration of Attribute Relationships ↑¶

Cross tabulation ↑¶

Cross Tabulation Defined:

  • A statistical technique used to explore relationships between two categorical variables. It organizes the data into a table format, with one variable represented by rows and the other variable represented by columns. Each cell in the table shows the count or frequency of occurrences for a combination of categories from both variables. Cross tabulation enables the examination of how the two variables relate to each other and helps identify patterns, dependencies, or associations between the attributes.
In [32]:
# create a function for the mentioned feature above
def pass_fail(score):
    if score >= 70:
        return "Pass"
    else:
        return "Fail"
In [33]:
df["MathResult"] = df["MathScore"].apply(pass_fail)
df["ReadingResult"] = df["ReadingScore"].apply(pass_fail)
df["WritingResult"] = df["WritingScore"].apply(pass_fail)

Math Cross Tabulations¶

In [34]:
cross_tab1 = pd.crosstab(df['ParentEduc'], df['MathResult'], dropna=False)
print("\n",cross_tab1,"\n")

# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab1.sum(axis=1)
conditional_probs = cross_tab1.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")

cross_tab2 = pd.crosstab(df['LunchType'], df['MathResult'], dropna=False)
print("\n",cross_tab2,"\n")

# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab2.sum(axis=1)
conditional_probs = cross_tab2.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")

cross_tab3 = pd.crosstab(df['EthnicGroup'], df['MathResult'], dropna=False)
print("\n",cross_tab3,"\n")

# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab3.sum(axis=1)
conditional_probs = cross_tab3.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")

cross_tab4 = pd.crosstab(df['TestPrep'], df['MathResult'], dropna=False)
print("\n",cross_tab4,"\n")

# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab4.sum(axis=1)
conditional_probs = cross_tab4.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
 MathResult          Fail  Pass
ParentEduc                    
some high school    3662  1855
high school         3528  2159
some college        4794  3684
associate's degree  2896  2654
bachelor's degree   1566  1820
master's degree      814  1209 

Conditional Probabilities:
MathResult              Fail      Pass
ParentEduc                            
some high school    0.663767  0.336233
high school         0.620362  0.379638
some college        0.565464  0.434536
associate's degree  0.521802  0.478198
bachelor's degree   0.462493  0.537507
master's degree     0.402373  0.597627

 MathResult    Fail   Pass
LunchType                
free/reduced  8147   2589
standard      9113  10792 

Conditional Probabilities:
MathResult        Fail      Pass
LunchType                       
free/reduced  0.758849  0.241151
standard      0.457825  0.542175

 MathResult   Fail  Pass
EthnicGroup            
group A      1465   754
group B      3751  2075
group C      6673  4379
group D      4010  3493
group E      1361  2680 

Conditional Probabilities:
MathResult       Fail      Pass
EthnicGroup                    
group A      0.660207  0.339793
group B      0.643838  0.356162
group C      0.603782  0.396218
group D      0.534453  0.465547
group E      0.336798  0.663202

 MathResult   Fail  Pass
TestPrep               
completed    4848  5107
none        12412  8274 

Conditional Probabilities:
MathResult      Fail      Pass
TestPrep                      
completed   0.486991  0.513009
none        0.600019  0.399981
In [35]:
# plot all Math Cross-tabulations
fig, axes = plt.subplots(nrows=2,ncols=2, figsize=(12,10),dpi=200)

cross_tab_plots = [
    (cross_tab1, "Parent Educ"),
    (cross_tab2, "Lunch Type"),
    (cross_tab3, "EthnicGroup"),
    (cross_tab4, "Test Prep")
]

for i, (cross_tab, label) in enumerate(cross_tab_plots):
    ax = axes[i//2, i%2]
    cross_tab.plot(kind='bar',stacked = False, ax =ax)
    ax.set_ylabel("Count")
    ax.set_title(f"Cross Tabulation: {label} Vs Math Result")
    ax.legend(title = "Test Score Result")
    
plt.tight_layout()

plt.show()

Reading Cross Tabulations¶

In [36]:
cross_tab1a = pd.crosstab(df['ParentEduc'], df['ReadingResult'], dropna=False)
print("\n",cross_tab1a,"\n")

# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab1a.sum(axis=1)
conditional_probs = cross_tab1a.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")

cross_tab2a = pd.crosstab(df['LunchType'], df['ReadingResult'], dropna=False)
print("\n",cross_tab2a,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab2a.sum(axis=1)
conditional_probs = cross_tab2a.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")

cross_tab3a = pd.crosstab(df['EthnicGroup'], df['ReadingResult'], dropna=False)
print("\n",cross_tab3a,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab3a.sum(axis=1)
conditional_probs = cross_tab3a.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")

cross_tab4a = pd.crosstab(df['TestPrep'], df['ReadingResult'], dropna=False)
print("\n",cross_tab4a,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab4a.sum(axis=1)
conditional_probs = cross_tab4a.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
 ReadingResult       Fail  Pass
ParentEduc                    
some high school    3308  2209
high school         3127  2560
some college        4267  4211
associate's degree  2524  3026
bachelor's degree   1357  2029
master's degree      670  1353 

Conditional Probabilities:
ReadingResult           Fail      Pass
ParentEduc                            
some high school    0.599601  0.400399
high school         0.549851  0.450149
some college        0.503303  0.496697
associate's degree  0.454775  0.545225
bachelor's degree   0.400768  0.599232
master's degree     0.331191  0.668809

 ReadingResult  Fail   Pass
LunchType                 
free/reduced   6842   3894
standard       8411  11494 

Conditional Probabilities:
ReadingResult      Fail      Pass
LunchType                        
free/reduced   0.637295  0.362705
standard       0.422557  0.577443

 ReadingResult  Fail  Pass
EthnicGroup              
group A        1303   916
group B        3197  2629
group C        5706  5346
group D        3566  3937
group E        1481  2560 

Conditional Probabilities:
ReadingResult      Fail      Pass
EthnicGroup                      
group A        0.587201  0.412799
group B        0.548747  0.451253
group C        0.516287  0.483713
group D        0.475277  0.524723
group E        0.366493  0.633507

 ReadingResult   Fail  Pass
TestPrep                  
completed       3802  6153
none           11451  9235 

Conditional Probabilities:
ReadingResult      Fail      Pass
TestPrep                         
completed      0.381919  0.618081
none           0.553563  0.446437
In [37]:
# plot all Reading Cross-Tabulations
fig, axes = plt.subplots(nrows=2,ncols=2, figsize=(12,10),dpi=200)

cross_tab_plots = [
    (cross_tab1a, "Parent Educ"),
    (cross_tab2a, "Lunch Type"),
    (cross_tab3a, "EthnicGroup"),
    (cross_tab4a, "Test Prep")
]

for i, (cross_tab, label) in enumerate(cross_tab_plots):
    ax = axes[i//2, i%2]
    cross_tab.plot(kind='bar',stacked = False, ax =ax)
    ax.set_ylabel("Count")
    ax.set_title(f"Cross Tabulation: {label} Vs Reading Result")
    ax.legend(title = "Test Score Result")
    
plt.tight_layout()

plt.show()

Writing Cross Tabulations¶

In [38]:
cross_tab1b = pd.crosstab(df['ParentEduc'], df['WritingResult'], dropna=False)
print("\n",cross_tab1b,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab1b.sum(axis=1)
conditional_probs = cross_tab1b.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")

cross_tab2b = pd.crosstab(df['LunchType'], df['WritingResult'], dropna=False)
print("\n",cross_tab2b,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab2b.sum(axis=1)
conditional_probs = cross_tab2b.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")

cross_tab3b = pd.crosstab(df['EthnicGroup'], df['WritingResult'], dropna=False)
print("\n",cross_tab3b,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab3b.sum(axis=1)
conditional_probs = cross_tab3b.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")

cross_tab4b = pd.crosstab(df['TestPrep'], df['WritingResult'], dropna=False)
print("\n",cross_tab4b,"\n")
# Calculate and print the conditional probabilities
total_by_parent_educ = cross_tab4b.sum(axis=1)
conditional_probs = cross_tab4b.div(total_by_parent_educ, axis=0)
print(f"Conditional Probabilities:\n{conditional_probs}")
 WritingResult       Fail  Pass
ParentEduc                    
some high school    3542  1975
high school         3417  2270
some college        4414  4064
associate's degree  2612  2938
bachelor's degree   1366  2020
master's degree      635  1388 

Conditional Probabilities:
WritingResult           Fail      Pass
ParentEduc                            
some high school    0.642016  0.357984
high school         0.600844  0.399156
some college        0.520642  0.479358
associate's degree  0.470631  0.529369
bachelor's degree   0.403426  0.596574
master's degree     0.313890  0.686110

 WritingResult  Fail   Pass
LunchType                 
free/reduced   7189   3547
standard       8797  11108 

Conditional Probabilities:
WritingResult      Fail      Pass
LunchType                        
free/reduced   0.669616  0.330384
standard       0.441949  0.558051

 WritingResult  Fail  Pass
EthnicGroup              
group A        1351   868
group B        3391  2435
group C        6094  4958
group D        3486  4017
group E        1664  2377 

Conditional Probabilities:
WritingResult      Fail      Pass
EthnicGroup                      
group A        0.608833  0.391167
group B        0.582046  0.417954
group C        0.551393  0.448607
group D        0.464614  0.535386
group E        0.411779  0.588221

 WritingResult   Fail  Pass
TestPrep                  
completed       3548  6407
none           12438  8248 

Conditional Probabilities:
WritingResult      Fail      Pass
TestPrep                         
completed      0.356404  0.643596
none           0.601276  0.398724
In [39]:
# plot all Reading Cross-Tabulations
fig, axes = plt.subplots(nrows=2,ncols=2, figsize=(12,10),dpi=200)

cross_tab_plots = [
    (cross_tab1b, "Parent Educ"),
    (cross_tab2b, "Lunch Type"),
    (cross_tab3b, "EthnicGroup"),
    (cross_tab4b, "Test Prep")
]

for i, (cross_tab, label) in enumerate(cross_tab_plots):
    ax = axes[i//2, i%2]
    cross_tab.plot(kind='bar',stacked = False, ax =ax)
    ax.set_ylabel("Count")
    ax.set_title(f"Cross Tabulation: {label} Vs Writing Result")
    ax.legend(title = "Test Score Result")
    
plt.tight_layout()

plt.show()

Cross Tabulation Takeaways:¶

  • Associations: Cross tabulations are helpful in viewing relationships between categorical variables. Examining the counts in each cell of the table allows us to identify patterns and associations behind the variables.
    • For example, looking at Master's Degree and Writing Result we can see how the distribution of Passing scores is more than double that of the failing scores.

  • Conditional Probabilities: Cross-tabulations enable the calculation of conditional probabilities which allow us to determine the likelihood of a particular category occurring given a specific value of the other variable providing us with conditional relationships between variables.
    • Conditional Probabilities have been provided in the tables above.

  • Imbalances: Cross Tabulations can highlight any imbalances in the distribution of categories across variables. If certain categories are overrepresented or underrepresented in specific combinations, it can indicate potential biases, preferences, or discrepancies that require further investigation.
    • A good example of this is the distribution we have among Ethnic Groups. Group A is a very minor component in terms of representation compared to each of the other groups

Correlation Heatmaps ↑¶

Correlation Heatmap Defined:

  • Correlation heatmaps are visual representations that use color gradients to depict the correlation coefficients between pairs of attributes or variables in a dataset. They provide a concise overview of the strength and direction of relationships between variables. Correlation heatmaps are useful in exploring attribute relationships because they allow for quick identification of patterns and dependencies, highlighting variables that are positively or negatively correlated.

Correlation Heatmap Warning:

  • One-hot encoded correlation plots are not extremely appropriate in certain scenarios because of:
    • Loss of information: Because each category is transformed into a separate binary column, it can result in a high-dimensional dataset. These correlations may not provide meaningful insights due to the loss of original categorical information.
    • Misinterpretation of Correlations: True correlative relationships may not be viewed accurately.
    • Magnitude vs Direction: Focus can be on the presence or absence of a specific category, rather than the magnitude or direction of the relationship.

Correlation Heatmap Purpose:

  • A one-hot encoded correlation plot can still provide value by offering a high-level overview of the relationships between attributes. During the process of discovery this can be especially helpful as we look to poke and prod the data in a multitude of ways.
  • A correlation plot that is one-hot encoded can still help identify potential patterns or dependencies between categories within different variables. During this process of discovery this serves as our use function, and not speculate further.
In [40]:
df["EthnicGroup"] = df["EthnicGroup"].astype(object)
df["ParentEduc"] = df["ParentEduc"].astype(object)

# new dfs with/without object type
df_obj = df.select_dtypes(include = "object")
df_num = df.select_dtypes(exclude = "object")

# view levels of columns
for col in df_obj.columns:
    unique_values = df_obj[col].unique()
    #print(f"Unique values in column '{col}':{unique_values}")
In [41]:
# creating dummy var for the object type df. 
# dropping first = True for multicollinearity dummy trap
df_obj = pd.get_dummies(df_obj, drop_first = True)

# combining into one df
cor_df = pd.concat([df_num, df_obj], axis = 1)
In [42]:
plt.figure(figsize=(12,10), dpi=300)
sns.heatmap(cor_df.corr(),cmap = "magma", linecolor='white', linewidths=.75)
plt.show()
In [43]:
# raw data corr viz. 
plt.figure(figsize=(12,8), dpi=200)
sns.heatmap(df.corr(), cmap = "magma", annot=True)
plt.show()
In [44]:
cor_df.corr()["WritingScore"].sort_values(ascending=False)
Out[44]:
WritingScore                    1.000000
ReadingScore                    0.952584
WritingResult_Pass              0.808938
MathScore                       0.807118
ReadingResult_Pass              0.773939
MathResult_Pass                 0.650314
LunchType_standard              0.274305
ParentEduc_master's degree      0.136667
ParentEduc_bachelor's degree    0.112119
EthnicGroup_group E             0.107477
EthnicGroup_group D             0.091160
PracticeSport_regularly         0.056602
WklyStudyHours_> 10             0.039798
IsFirstChild_yes                0.009822
ParentEduc_some college         0.001525
ParentMaritalStatus_widowed     0.001315
ParentMaritalStatus_married    -0.000964
NrSiblings                     -0.001584
TransportMeans_school_bus      -0.004522
ParentMaritalStatus_single     -0.008681
PracticeSport_sometimes        -0.022002
WklyStudyHours_< 5             -0.052162
EthnicGroup_group C            -0.061737
EthnicGroup_group B            -0.079176
ParentEduc_high school         -0.092659
ParentEduc_some high school    -0.145231
TestPrep_none                  -0.282308
Gender_male                    -0.289312
Name: WritingScore, dtype: float64

When determining Writing Score with correlation, we find that there are several factors that can help us.

Firstly, the strongest relationships with the Writing scores are the other two exam scores(Math & Reading). Another strong relationship are the exam results which shouldn't come as a surprise since they're related to the score. Everything else can be classified as weak since their correlation is between 0.274 and -0.28. Some weak relationships that stand out are the gender(male) at -0.289312, TestPrep(none) at -0.282308 and LunchType(standard) at 0.274305.

This seems to affirm our findings in the previous section when we created boxplot to explore any relationship between these variables with the Writing Score.

In [45]:
cor_df.corr()["ReadingScore"].sort_values(ascending=False)
Out[45]:
ReadingScore                    1.000000
WritingScore                    0.952584
MathScore                       0.817825
ReadingResult_Pass              0.812386
WritingResult_Pass              0.773737
MathResult_Pass                 0.657498
LunchType_standard              0.258152
EthnicGroup_group E             0.128716
ParentEduc_master's degree      0.116293
ParentEduc_bachelor's degree    0.087993
EthnicGroup_group D             0.038766
WklyStudyHours_> 10             0.030269
PracticeSport_regularly         0.028254
IsFirstChild_yes                0.012117
ParentMaritalStatus_widowed     0.002603
ParentMaritalStatus_married     0.000164
NrSiblings                     -0.000412
TransportMeans_school_bus      -0.004933
ParentEduc_some college        -0.007873
ParentMaritalStatus_single     -0.008195
PracticeSport_sometimes        -0.008583
EthnicGroup_group C            -0.043759
WklyStudyHours_< 5             -0.049362
EthnicGroup_group B            -0.067535
ParentEduc_high school         -0.069982
ParentEduc_some high school    -0.122773
TestPrep_none                  -0.204724
Gender_male                    -0.237097
Name: ReadingScore, dtype: float64

The Reading Score seems to tell the same story where Math, Writing Scores and Exam Results have the highest/strongest correlation.

The other weak variables are the same, but should be given attention to, even though their values are a little different. We have: Gender(male) at -0.237097, TestPrep(none) at -0.204724 and LunchType(standard) at 0.258152.

In [46]:
cor_df.corr()["MathScore"].sort_values(ascending=False)
Out[46]:
MathScore                       1.000000
ReadingScore                    0.817825
WritingScore                    0.807118
MathResult_Pass                 0.802676
ReadingResult_Pass              0.665192
WritingResult_Pass              0.653592
LunchType_standard              0.367942
EthnicGroup_group E             0.221775
Gender_male                     0.162391
ParentEduc_master's degree      0.100001
ParentEduc_bachelor's degree    0.089675
WklyStudyHours_> 10             0.062945
PracticeSport_regularly         0.061482
EthnicGroup_group D             0.041074
IsFirstChild_yes                0.014809
ParentMaritalStatus_widowed     0.007399
ParentMaritalStatus_married     0.007144
TransportMeans_school_bus       0.002351
NrSiblings                     -0.000538
ParentEduc_some college        -0.004527
ParentMaritalStatus_single     -0.014035
PracticeSport_sometimes        -0.018130
ParentEduc_high school         -0.065967
WklyStudyHours_< 5             -0.078084
EthnicGroup_group C            -0.078949
EthnicGroup_group B            -0.096779
ParentEduc_some high school    -0.121240
TestPrep_none                  -0.134949
Name: MathScore, dtype: float64

The Math Score correlation is slightly different from our previous two(Writing and Reading). A couple of similarities between this and the previous two are that the other two exams scores and its results have the highest correlation with our interested variable.

Next, we find that some weak relationship that should be paid attention to are: LunchType(standard) at 0.367942, EthnicGroup(E) at 0.221775, Gender(male) at 0.162391 and, if we want to add a weaker link, we can explore TestPrep(none) at -0.134949.

Visual Exploration | Extra Plots For Exploring Relationships ↑¶

1) ParentEduc | Gender | Score Visual¶

Now we want to take a look at the scores when ParentEduc is paired with another variable, so we can understand our data better. In this case, we can see how the scores fare when we pair the education level with Gender.

On the writing and reading scores, no matter the education level, it seems that the female scores higher than the male. However, the male scores higher in the math scores across all the education levels.

In [47]:
df3["ParentEduc"] = df3["ParentEduc"].astype('category')
df3["ParentEduc"] = df3["ParentEduc"].cat.reorder_categories(["some high school","high school", 
                                                              "some college", "associate's degree", 
                                                              "bachelor's degree", "master's degree"])

plot_imputed_2var("ParentEduc", "Gender")

2) Ethnic Group | Test Prep | Score Visual¶

Similar to the previous visual analysis, we want to add another variable to see how the scores fare, but, this time, we want to see it with Ethnic Group and Test Prep.

Across the board, we see that those who've completed their test preps achieve a higher score than those who have done none, which shouldn't come as a surprise. It seems like Group E scores the highest overall, except when we examine the Writing Score. Group E and D seem to be neck-and-neck, having roughly the same score with possibly an infinitesimal difference.

In [48]:
df3['EthnicGroup'] = df3['EthnicGroup'].astype('category')
df3["EthnicGroup"].cat.reorder_categories(["group A","group B", "group C", "group D", "group E"])

plot_imputed_2var("EthnicGroup", "TestPrep")

3) Test Prep | Gender | Score Visual¶

Taking Test Prep and Scores and adding Gender, we want to create a multivariate plot with all the variables.

Similar to what we found above, those who've completed the test prep have higher scores. Females have a higher median on reading and writing while male take the prize for math.

In [49]:
plot_imputed_2var("TestPrep", "Gender")

4) ParentMaritalStatus | Score Visual¶

Visual Interpretation

  • There doesn't seem to be much of a difference in the scores regardless of a student's parent's marital status. However, we should note(from the countplot)that there are far more married and very few widowed.

Why the Visual is Appropriate:

  • A box plot is an appropriate visual for comparing the test scores for groups who did/didn't participate in test prep. The graphic provides a clear and concise summary of the distribution of the data for each group. By plotting the boxplots side by side, we can easily compare the central tendency, spread, and skewness of the test scores for each group. Additionally, boxplots can effectively display potential outliers and extreme values in the data, which can be helpful in identifying differences between the groups. Overall, boxplots can provide a useful visual summary of the distribution of the data and help us compare the performance of the test prep groups.
In [50]:
df3["ParentMaritalStatus"] = df3["ParentMaritalStatus"].astype('category')
df3["ParentMaritalStatus"] = df3["ParentMaritalStatus"].cat.reorder_categories(["single", "married", "divorced",
                                                                               "widowed"])

plot_imputed("ParentMaritalStatus")

5) ParentMaritalStatus | LunchType | Score Visual¶

Once again, using ParentMaritalStatus LunchType and above and adding a new one(LunchType), we want to create a multivariate plot to see how the scores behave.

This is interesting because we didn't see much of a difference in score across marital status by itself, but, when we add Lunch Type, we find that those with free/reduced lunches score significantly lower than their counterpart.

In [51]:
plot_imputed_2var("ParentMaritalStatus", "LunchType")

6) Practice Sport | Score Visual¶

Looking at the boxplots, we find that those who practice sports regularly achieve higher test scores across the board and, the less they practice, the lower their test scores.One interesting aspect to note is the number of students who claim to never practice sports. It is less than half of those who practice regularly.

In [52]:
plot_imputed("PracticeSport")

7) Practice Sport |WklyStudyHrs | Score Visual¶

Adding WklyStudyHours to our plot, we don't really discover anything new. Those who study more than 10 hours have a higher median score whether or not a person practices sports, but the difference is miniscule for Reading and Writing. However, the difference is more notable in math as the difference between each category for WklyStudyHours is more noticeable.

In [53]:
plot_imputed_2var("WklyStudyHours", "PracticeSport")

8) WklyStudyHrs | Score Visual¶

The boxplots proved what we expected through intuition. The more an individual studies the higher the score they achieve. This is true across the board. However, we should note that the count for those who study 5-10 hours is much higher than the other two.

In [54]:
plot_imputed("WklyStudyHours")

9) WklyStudyHrs | EthnicGroup | Score Visual¶

This time we would like to create a multivariable plot with our previous 2 variables and EthnicGroup.

As expected, regardless of Ethnic Group, those who've logged more than 10 hours tend to have a higher score than those who've studied for less than 10. It's also interesting to note that Group E has the highest median score across all the plots when comparing each respective hour range of each group. One interesting aspect to note is that in Group A, those who've logged more than 10 hours have a notable difference than those who've logged 5-10 hours. This difference is greater than the difference of all the other groups(comparing 5-10 and >10).

In [55]:
plot_imputed_2var("EthnicGroup", "WklyStudyHours")

Creating Additional Features ↑¶

Features that could be added to the data: 1) School (or geographic location/district)

2) Teacher (or class)

3) Extracurricular Activities (other than sport specifically)

4) Parent Household Income

5) Work with Tutor

6) Teacher offers tutoring

7) Age

8) Participant in "student government"

9) Parent participation in PTA

10) Class or school average GPA

11) Student GPA

Features that can be created from existing features:

1) Pass/Fail for Math/Reading/Writing can be created by creating levels of score data (>=70 : pass, <70 : Fail)

Pass / Fail Variable Creation ↑¶

Pass | Fail Variable

  • The new pass/fail variable adds a binary assessment dimension to the data, allowing for a simplified understanding and analysis of student performance. When it comes to making predictions, having a pass/fail variable instead of a continuous variable can impact the methods we are able to attempt such as:

1) Classification vs. Regression: With a pass/fail variable, the prediction task typically shifts from regression (predicting a continuous variable) to classification (predicting a discrete class label). This change in problem formulation affects the choice of algorithms and evaluation metrics. Classification algorithms such as logistic regression, decision trees, random forests, or support vector machines may be more suitable for predicting pass/fail outcomes.

2) Interpretability: Predicting a pass/fail outcome can be more interpretable and actionable than predicting a continuous variable. This new attribute allows us to gain insights into the factors contributing to pass/fail predictions, identifying which attributes or features have the most significant influence. This interpretability can aid decision-making, intervention strategies, and understanding the factors driving student success.

3) Model evaluation: The evaluation of pass/fail predictions often requires different metrics than those used for continuous variables. Accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC) are common metrics used to assess the performance of classification models.

Feature Creation Summary¶

The potential addition of various features to the data, including school, teacher, extracurricular activities, parent household income, and more, can provide valuable insights for analyzing student performance. Furthermore, creating a pass/fail variable based on score data introduces a binary assessment dimension that simplifies the understanding and analysis of student outcomes.


By incorporating these additional features and creating the pass/fail variable, educators and researchers can gain valuable insights into student performance, make informed decisions, and develop targeted interventions to improve educational outcomes.

Exceptional Additional Work ↑¶

Dimensionality Reduction¶

Dimensionality reduction is the process of reducing the number of variables or features in a dataset. This is often done to simplify the data and make it easier to analyze or visualize. The two main types of dimensionality reduction are feature selection and feature extraction. Feature selection involves selecting a subset of the original variables, while feature extraction involves transforming the variables into a new set of variables. Principal Component Analysis (PCA) is a common technique used for feature extraction.

PCA was attempted but as expected it was not very useful/beneficial. PCA is more appropriate when dealing with many numerical features. While one-hot encoding variables is an option to manipulate data into a "numerical" form, this will not result in making PCA a more appropriate choice.

MCA seems to be a more appropriate process to implement and would serve as a more effective alternative. Before proceeding it must be stated compared to PCA, MCA loses the power of the interpretation that we would have had with PCA created variables.

MCA (Multiple Correspondence Analysis) ↑¶

In [56]:
import prince
import altair as alt
In [57]:
# allows us to work with the altair functionality and bypass the row limitation
alt.data_transformers.disable_max_rows()
Out[57]:
DataTransformerRegistry.enable('default')
In [58]:
df = pd.read_csv('Expanded_data_with_more_features.csv') # read in data
df = df.drop("Unnamed: 0", axis = 1)
In [59]:
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
In [60]:
# create a function for the mentioned feature above
def pass_fail(score):
    if score >= 70:
        return "Pass"
    else:
        return "Fail"
In [61]:
df["MathResult"] = df["MathScore"].apply(pass_fail)
df["ReadingResult"] = df["ReadingScore"].apply(pass_fail)
df["WritingResult"] = df["WritingScore"].apply(pass_fail)
In [62]:
selected_columns = ["MathResult","ReadingResult","WritingResult","EthnicGroup",
                    "ParentEduc", "LunchType", "TestPrep", "PracticeSport",
                    "ParentMaritalStatus","TransportMeans","WklyStudyHours",
                    "IsFirstChild", "Gender"]

df[selected_columns] = df[selected_columns].astype('category')
In [63]:
# set seed
np.random.seed(12)

df_dummy = pd.get_dummies(df[selected_columns])

# Perform (MCA) on the dummy dataframe
mca = prince.MCA(n_components=2, random_state=12)
mca = mca.fit(df_dummy)

# Get the column coordinates from MCA
mca_coordinates = mca.column_coordinates(df_dummy)

# Plot the MCA results with labels for categorical variables
plt.figure(figsize=(10, 8))
plt.scatter(mca_coordinates.iloc[:, 0], mca_coordinates.iloc[:, 1])

# Add labels for categorical variables
for i, (x, y) in enumerate(zip(mca_coordinates.iloc[:, 0], mca_coordinates.iloc[:, 1])):
    plt.text(x, y, mca_coordinates.index[i], ha='center', va='center')

plt.xlabel('Component 0')
plt.ylabel('Component 1')
plt.title('MCA Plot with Categorical Variable Labels')
plt.grid(True)
plt.show()
APPEND FILE PATH AND RUN ONLY IF NEEDED
¶
In [64]:
# importing module
# import sys
 
# appending a path
#sys.path.append('/Users/dannychang/opt/miniconda3/pkgs/adjusttext-0.7.3.1-py_1/site-packages')
In [65]:
#!conda install adjustText
from adjustText import adjust_text

np.random.seed(12)

df_dummy = pd.get_dummies(df[selected_columns])

# Perform Multiple Correspondence Analysis (MCA) on the dummy dataframe
mca = prince.MCA(n_components=2, random_state=12)
mca = mca.fit(df_dummy)

# Get the column coordinates from MCA
mca_coordinates = mca.column_coordinates(df_dummy)

# Plot the MCA results with labels for categorical variables
plt.figure(figsize=(10, 8))
plt.scatter(mca_coordinates.iloc[:, 0], mca_coordinates.iloc[:, 1], color='blue')

# Define base line length
line_length = 0.1  # adjust as needed

# Calculate the median for x and y
median_x = mca_coordinates.iloc[:, 0].median()
median_y = mca_coordinates.iloc[:, 1].median()

# Variables to hold line ends and text objects for adjustment
line_ends_x = []
line_ends_y = []
texts = []

# Add labels for categorical variables with lines for clarity
for i, (x, y) in enumerate(zip(mca_coordinates.iloc[:, 0], mca_coordinates.iloc[:, 1])):
    label = mca_coordinates.index[i]

    # Adjust line length based on density of points
    density = len([point for point in zip(mca_coordinates.iloc[:, 0], mca_coordinates.iloc[:, 1]) 
                   if np.sqrt((x - point[0]) ** 2 + (y - point[1]) ** 2) < line_length])
    adjusted_line_length = line_length * density

    if x <= median_x and y >= median_y:      # Quadrant II
        line_end = (x-adjusted_line_length, y-adjusted_line_length)
        plt.plot([x, line_end[0]], [y, line_end[1]], color='red')
    elif x > median_x and y >= median_y:     # Quadrant I
        line_end = (x+adjusted_line_length, y-adjusted_line_length)
        plt.plot([x, line_end[0]], [y, line_end[1]], color='red')
    elif x <= median_x and y < median_y:     # Quadrant III
        line_end = (x-adjusted_line_length, y+adjusted_line_length)
        plt.plot([x, line_end[0]], [y, line_end[1]], color='red')
    else:                                    # Quadrant IV
        line_end = (x+adjusted_line_length, y+adjusted_line_length)
        plt.plot([x, line_end[0]], [y, line_end[1]], color='red')

    line_ends_x.append(line_end[0])
    line_ends_y.append(line_end[1])
    texts.append(plt.text(line_end[0], line_end[1], label, ha='center', va='center'))

# Adjust labels to avoid overlap
adjust_text(texts, x=line_ends_x, y=line_ends_y)

plt.xlabel('Component 0')
plt.ylabel('Component 1')
plt.title('MCA Plot with Categorical Variable Labels')
plt.grid(True)
plt.show()

MCA Interpretation¶

Multiple Correspondence Analysis (MCA) is a statistical visualization technique used for categorical variables. MCA plots components and highlights possible relationships and patterns in the data. Each point on the plot represents a level of a categorical variable, and the distance between these points represents the relationships between them.

Some key insights we can obtain from an MCA plot are:

  • Proximity of Points: Points (or labels representing levels of categorical variables) close to each other on the plot share similar profiles, which means they have a similar distribution across the categorical variables in the dataset.

  • Proximity to Origin: Points located near the origin do not contribute much to the overall variability in the dataset. In contrast, points located far from the origin are particularly characteristic of the respective dimension they're located on.

  • Dimensions: The axes (dimensions) represent the underlying structure of the data. The first dimension (axis) explains the most variance in the dataset, the second dimension explains the second most variance, and so on.

  • Explained Variance: The total variance explained by the plot can be obtained. This gives us an idea of how much of the total variability of the dataset is represented in the plot.

From our plot we can use this to explore and interpret relationships between variables. An example of this would be examining where our response variables lie on the plot (such as MathResult_Pass or MathResult_Fail) and then using proximity to determine which other variables share similar profiles.

Conclusion | Comprehensive Summary ↑¶

Business Understanding:

The dataset consists of scores from three tests along with personal and socio-economic factors, collected with the goal of identifying relationships between these factors and test scores. The objective is to determine the most significant factors influencing test scores, which can help educators, parents, and students focus on variables that can improve academic performance. The information gathered can inform educational policies, practices, and interventions to support struggling students and enhance overall educational outcomes. Understanding and analyzing the dataset can lead to insights regarding student performance factors, teaching practices, and the impact of socio-economic status on academic outcomes. Evaluating the effectiveness of prediction algorithms involves assessing their accuracy in predicting test scores and their interpretability in identifying influential factors. Both accuracy and interpretability are important in evaluating the effectiveness of a prediction algorithm.

Data Understanding:

  • Missing data:
    After thorough consideration and investigation of the missing data in our dataset, we have decided to proceed with imputing the missing values based on frequency. The dataset initially exhibited a substantial number of missing values, amounting to 37.2% of the data, particularly in categorical variables. While the reasons for missingness remain unclear due to limited information on data collection, we have assumed that the missing data is Missing Completely at Random (MCAR).


    To handle this missing data, we explored various options such as discarding observations with missing values, which would result in a significant loss of information, or relying on learning algorithms to handle missing values during training. However, we ultimately chose to impute the missing values based on frequency.


    By imputing the missing values using the frequency-based approach, we can retain the entirety of our dataset and preserve the sample size, maintaining statistical power and precision. This method involves filling in the missing values with the most frequent category or response in each respective variable. While imputation has its limitations, including potential bias and assumptions, the frequency-based approach is a practical and reasonable choice given the MCAR assumption and the nature of our categorical variables.


    It is important to acknowledge that imputation based on frequency may introduce some uncertainty, as it does not account for the variability of the missing values. Nevertheless, this chosen approach strikes a balance between preserving the integrity of the dataset and mitigating the potential biases associated with missing data.


    In summary, by opting to impute the missing values based on frequency, we aim to address the missingness in our dataset, maximize the utilization of available data, and facilitate robust analyses that can contribute to informed decision-making and a deeper understanding of the relationships and factors influencing the outcomes of interest.


  • Data Summary Statistics
    Analyzing the statistics for Reading Score, Math Score, and Writing Score provides valuable insights for educators, researchers, and policymakers. These statistics enable the identification of areas of improvement and performance trends among students. For instance, if the mean score for Reading Score is significantly lower than that of Math Score and Writing Score, it suggests a need for targeted efforts to enhance reading skills. Similarly, if the standard deviation of Math Score is considerably higher than that of Reading Score and Writing Score, it indicates greater variability in math performance, necessitating investigation and potential interventions.

    The provided statistics, including count, mean, median, standard deviation, minimum, quartiles, maximum, mode, and range, aid in understanding the performance distribution and characteristics of students in these subjects. They assist in comparing different groups, identifying common scores, central tendencies, variability, and the spread of scores. The sample size, indicated by the count, offers insights into the data's representativeness and generalizability.

    By leveraging these statistics, educators can tailor instructional strategies to address specific areas of weakness, researchers can examine factors influencing student performance, and policymakers can design interventions to improve academic outcomes. Understanding the nuances and patterns within the dataset enhances decision-making processes and supports efforts to promote educational excellence for all students.


  • Top Attributes
    The visualizations provide valuable insights into the relationship between different variables and test scores.

    1) Gender: The visualizations indicate that females tend to have higher scores in Reading and Writing, while males score higher in Math. This suggests a potential gender-based difference or correlation in the relationship between math and reading/writing scores.

    2) Parent Education: The visualizations demonstrate that there is an increasing pattern of median test scores as education levels increase. The boxplots provide a clear summary of the distribution of test scores for different levels of education, allowing for easy comparison and identification of central tendency, spread, and skewness.

    3) Ethnic Groups: The visualizations show that the median test scores increase as we move from ethnic groups A to E. The boxplots effectively summarize the distribution of test scores for each ethnic group, enabling comparisons and identification of patterns or differences.

    4) Test Prep: The visualizations confirm that completing test prep is associated with higher median scores compared to those who did not participate. The boxplots provide a concise summary of the distribution of test scores, allowing for easy comparison of central tendency, spread, and skewness between the two groups.

    5) Lunch Type: The visualizations reveal that students with free/reduced lunch tend to score significantly lower than those with a standard lunch plan. The boxplot summarizes the distribution of test scores and provides insights into the central tendency, spread, and skewness of the data.

Overall, these visualizations are appropriate and effective in conveying the relationships between different variables and test scores, providing a quick summary of the data distribution and highlighting important patterns or differences.


  • Dimensionality Reduction
    Dimensionality reduction techniques were explored to simplify the dataset and aid in analysis and visualization. Principal Component Analysis (PCA) was attempted but deemed inappropriate for the dataset due to the presence of categorical variables. Multiple Correspondence Analysis (MCA) was identified as a more appropriate alternative for categorical variables. MCA plots were used to visualize relationships and patterns in the data, with proximity of points indicating similarity in profiles across variables. The dimensions represented the underlying structure of the data, and the explained variance provided insights into the overall variability captured in the plot. MCA facilitated the exploration and interpretation of relationships between variables, allowing for analysis based on proximity and identification of variables with similar profiles.